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^ ' We describe a binding schema markup language (BSML) for describing data interchange between scientific codes. 

Such a facility is an important constituent of scientific problem solving environments (PSEs). BSML is designed 
to integrate with a PSE or application composition system that views model specification and execution as a 
problem of managing semistructured data. The data interchange problem is addressed by three techniques for 
J> ' processing semistructured data: validation, binding, and conversion. We present BSML and describe its applica- 

C^ - . tion to a PSE for wireless communications system design. 

(N 
O i 

^ 1 Introduction 

(N : 

O Problem solving environments (PSEs) are high-level software systems for doing computational science. A simple 
example of a PSE is the Web PELLPACK system [|(]] that addresses the domain of partial differential equations 
Q | (PDEs). Web PELLPACK allows the scientist to access the system through a Web browser, define PDE problems, 
. choose and configure solution strategies, manage appropriate hardware resources (for solving the PDE), and visualize 
^ and analyze the results. The scientist thus communicates with the PSE in the vernacular of the problem, 'not in the 



language of a particular operating system, programming language, or network protocol' [16]. It is 10 years since 
the goal of creating PSEs was articulated by an NSF workshop (see [|lj]] for findings and recommendations). From 
providing high-level programming interfaces for widely used software libraries [O], PSEs have now expanded to 



diverse application domains such as wood-based composites design [18], aircraft design [17], gas turbine dynamics 



simulation J15], and microarray bioinformatics [Q|. 

The basic functionalities expected of a PSE include supporting the specification, monitoring, and coordination 
of extended problem solving tasks. Many PSE system designs employ the compositional modeling paradigm, where 
the scientist describes data-flow relationships between codes in terms of a graphical network and the PSE manages 
the details of composing the application represented by the network. Compositional modeling is not restricted to 
such model specification and execution but can also be used as an aid in performance modeling of scientific codes [§] 
(model analysis). 

We view model specification and execution as a data management problem and describe how a semistructured 
data model can be used to address data interchange problems in a PSE. Section [O] presents a motivating PSE sce- 
nario that will help articulate needs from a data management perspective. Section |2] elaborates on these ideas and 

*The work presented in this paper is supported in part by National Science Foundation grants EIA-9974956, EIA-9984317, and EIA- 
0103660. 
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briefly reviews pertinent related work. In particular, it identifies three basic levels of functionality — validation, bind- 
ing, and conversion — at which data interchange in application composition can be studied. Sections ||, ||, and || de- 
scribe our specific contributions along these dimensions, in the form of a binding schema markup language (BSML). 
Section |6] outlines how these ideas can be integrated within an existing PSE system design. A concluding discus- 
sion is provided in Section |7[ Aspects of the scenario described next will be used throughout this paper as running 
examples. 

1.1 Motivating Example 

S 4 W (Site-Specific System Simulator for Wireless system design) is a PSE being developed at Virginia Tech. S 4 W 
provides deterministic electromagnetic propagation and stochastic wireless system models for predicting the perfor- 
mance of wireless systems in specific environments, such as office buildings. S 4 W is also designed to support the 
inclusion of new models into the system, visualization of results produced by the models, integration of optimiza- 
tion loops around the models, validation of models by comparison with field measurements, and management of the 
results produced by a large series of experiments. S 4 W permits a variety of usage scenarios. We will describe one 
scenario in detail. 

A wireless design engineer uses S 4 W to study transmitter placement in an indoor environment located on the 
fourth floor of Durham Hall at Virginia Tech. The engineering goal is to achieve a certain performance objective 
within the given cost constraints. For a narrowband system, power levels at the receiver locations are good indicators 
of system performance. Therefore, minimizing the (spatial) average shortfall of received power with respect to 
some power threshold is a meaningful and well defined objective. The major cost constraints are the number of 
transmitters and their powers. Different transmitter locations and powers yield different levels of coverage. The 
situation is more complicated in a wideband system, but roughly the same process applies. A wideband system 
includes extra hardware not present in a narrowband system and the performance objective is formulated in terms of 
the bit error rate (BER), not just the power level. 

The first step in this scenario is to construct a model of signal propagation through the wireless communications 
channel. S 4 W provides ray tracing as the primary mechanism to model site-specific propagation effects such as 
transmission (penetration), reflection, and diffraction. The second step is to take into account antenna parameters 
and system resolution. These two steps are often sufficient to model the performance of a narrowband system. 
If a wideband system is being considered, the third step is to configure the specific wireless system. Parameters 
such as the number of fingers of the rake receiver and forward error collection codes are considered at this step. 
S 4 W provides a Monte-Carlo simulation of a WCDMA (wideband code division multiple access) family of wireless 
systems. In either case, the engineer configures a graph of computational components as shown in Fig. [lj The ovals 
correspond to computational components drawn from a mix of languages and environments. Hexagons enclose 
input and output data. Aggregation is used to simplify the interfaces of the components to each other and to the 
optimizer. In Fig. [j], rectangles represent aggregation. The propagation model is a component that consists of three 
connected subcomponents: triangulation, space partitioning, and ray tracing. Similarly, the wireless system model 
consists of (roughly) three components: data encoding, channel modeling, and signal decoding. All three steps are 
further aggregated into a complete site-specific system model. This model is then used in an optimization loop. 
The optimizer changes transmitter parameters (all other parameters remain fixed) and receives feedback on system 
performance. 

For a given environment definition in AutoCAD, the triangulation and space partitioning components are used 
to reduce the number of geometric intersection tests that will be performed by the ray tracer. Several iterations 
over space partitioning are necessary to achieve acceptable software performance. However, once the objective (an 
average of ten triangles per voxel) is met, the space partitioning can be reused in all future experiments with this 
environment. The engineer then configures the ray tracer to only capture reflection and transmission (penetration) 
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Figure 1: A site-specific system model in S 4 W. The system model consists of a propagation model, an antenna 
model (post processing), and a wireless system model. 
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Figure 2: Optimizing placement of three transmitters to cover eighteen rooms and a corridor bounded by the box 
in the upper left corner. The bounds for the placement of three transmitters are drawn with dotted lines. The initial 
transmitter positions are marked with crosses. The optimum coverage transmitter positions are marked with dots. 



effects. Although diffraction and scattering are important in indoor propagation [||], these phenomena are computa- 
tionally expensive to model in an optimization loop. The triangulation and space partitioning codes are meant for 
serial execution, whereas the ray tracer and the Monte Carlo wireless system models run on a 200 node Beowulf 
cluster of workstations. Post processing is available in both serial and parallel versions. The ray tracer and the post 
processor are written in C, whereas the WCDMA simulation is available in Matlab and Fortran 95 versions. 

A series of experiments is performed for various choices of antenna patterns, path loss parameters (influenced by 
material properties), and WCDMA system parameters. The predicted power delay profiles (PDPs) are then compared 
with the measurements from a channel sounder and the predicted bit error rates are compared with the published 
data. The parameters of the propagation model are calibrated for various locations. The validated propagation and 
wireless system models are finally enclosed in an optimization loop to determine the locations of transmitters that 
will provide adequate performance for a region of interest. The optimizer, written in Fortran 95, uses the Dividing 
RECTangles (DIRECT) algorithm of Jones et al. [19]. The parameters to the optimization problem and the optimal 
transmitter placement are depicted in Fig. |2[ The optimizer decided to move the transmitter in the upper right corner 
one room to the right of its initial position and the transmitter in the lower left corner two rooms to the right of its 
initial position. 

What requirements can we abstract from this scenario and how can they be flexibly supported by a data model? 
We first observe the diversity in the computational environment. Component codes are written in different languages 
and some of them are meant for parallel execution. In a research project such as S 4 W, many components are under 
active development, so their I/O specifications change over time. Second, the interconnection among components 
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is also flexible. Optimizing for power coverage and optimizing for bit error rate, while having similar motivations, 
require different topologies of computational components. Third, since different groups of researchers are involved 
in the project, there exists significant cognitive discordance among vocabularies, data formats, components, and 
even methodologies. For example, ray tracing models represent powers in a power delay profile in dBm (log scale). 
However, WCDMA models work with a normalized linear scale impulse response and an aggregate called the 
'energy-to-noise ratio.' Also, there is more than one way of calculating the energy-to-noise ratio. Since antennas 
generate noise that depends on their parameters, detailed antenna descriptions are necessary to calculate this ratio. 
However, researchers who are not concerned with antenna design seldom model the system at this level of detail. 
The typical practice is to use a fixed noise level in the calculations. Simulations of wireless systems abound in such 
approximations, ad hoc conversions, and simplifying assumptions. 



2 PSE Requirements for Data Interchange 

Culling from the above scenario, we arrive at a more formal list of data interchange requirements for application 
composition in a PSE. The PSE must support: 

1. components in multiple languages (C, FORTRAN, Matlab, SQL); 

2. changes in component interfaces; 

3. changes in interconnections among components; 

4. automatic unit conversion in data-flows; 

5. user-defined conversion filters; 

6. composition of components with slightly different interfaces; and 

7. stream processing. 

The reader might be suiprised that SQL is listed alongside FORTRAN, but both languages are used in S 4 W. 
Experiment simulations are written in procedural languages, while experiment data is stored in a relational database. 
Thus, developing a system that integrates with the PSE environment requires more than the ability to link scientific 
computing languages. It involves overcoming the impedance mismatch between languages developed for fundamen- 
tally different purposes. 

The last requirement above is related to composability — the ability to create arbitrary component topologies. As 
data interchange is pushed deeper into the computation, the unit of data granularity needs to become correspondingly 
smaller. The optimization loop is a good example of fine data granularity. We cannot accumulate all transmitter pa- 
rameters over all iterations and later convert them to the format required by the simulation inside the loop, because 
transmitter parameters generated by the optimizer depend on the feedback computed by the simulation. Each block 
of transmitters must be processed as soon as it is available. Likewise, each value of the objective function must be 
made available to the optimizer before it can produce the next block of transmitters. Usability dictates a similar 
requirement. Since some models are computationally expensive (e.g., those meant for parallel execution), incre- 
mental feedback should be provided to the user as early as possible. The stream processing requirement improves 



composability and usability, but limits conversions to being local. Global conversions (e.g., XSLT [|1 3[] ) cannot be 
performed because they assume that all the data is available at once. 

While the requirements point to a semistructured data model, no currently available data management system 
supports all forms of PSE functionality. This paper presents the prototype of such a system in the form of a markup 
language. Observe that all of the above requirements are summarized by three standard techniques for working 
with semistructured data — validation, binding, and conversion. Validation establishes data conformance to a given 
schema. It is a prerequisite to most of the requirements. Binding refers to integrating semistructured data with 
languages that were designed for different purposes (requirement 1). Conversion (transformation) takes care of 
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requirements 2-6. Given two slightly different schemas, it is possible to generate an edit script [11] that converts 
data instances from one schema to another. Requirement 7 dictates that all such conversions must be local. 



2.1 Related Work 

While research in PSEs covers a broad territory, the use of semistructured data representations in computational 
science is not established beyond a few projects. Therefore, we only survey standard XML technologies and PSE- 
like systems that make (some) use of semistructured data. It would be unfair to review some of these systems against 
PSE data interchange requirements. Instead, our evaluation is based on how well these systems support validation, 
binding, conversion, and stream processing. 

Specific XML technologies for document processing are easy to classify in terms of our framework. Schema 



languages (e.g., RELAX NG [12]) deal with validation and, possibly, binding. Transformation languages (e.g., 
XSLT [13]) deal with conversion. Several properties of these technologies hinder their direct applicability to a PSE 
setting. First and foremost, these technologies do not work with streams of data. Sophisticated schema constraints 
and complex transformations can require buffering the whole document before producing any output. Second, 
transformation languages are simply vehicles for applying edit scripts. They cannot be used to create edit scripts. 
Since our conversions are local, edit script application is trivial, but edit script creation is not. 

Four major flavors of PSE-like projects that use semistructured data representations can be identified: 

1. component metadata projects; 

2. workflow projects; 

3. scientific data interchange projects; and 

4. scientific data management projects. 



Projects in the first category use XML to store IDL-like (interface definition language) component descriptions 
and miscellaneous component execution parameters. An example of such a project is CCAT [Q], which is a dis- 
tributed object oriented system. CCAT also uses XML for message transport between components, so we say that 
it provides an 00 binding. The second category of projects augments component metadata with workflow spec- 
ifications. For example, GALE [|8|] is a workflow specification language for executing simulations on distributed 
systems. Unlike CCAT, GALE provides XML specifications for some common types of experiments, such as pa- 
rameter sweeps (CCAT uses a scripting language for workflow specification). However, GALE does not use XML 
for component data. Both the component metadata and workflow projects use XML to encode data that is not 
semistructured. Their use of XML is not dictated by the need for automatic conversion. Neither generic binding 
mechanisms nor conversion are provided by these projects. 

The latter two groups of projects use XML for application data, not component metadata. Representatives of 
the scientific data interchange group develop flexible all-encompassing schemas for specific application domains. 
For example, CACTUS [0] deals with spatial grid data. CACTUS 's schema is complex enough to be considered 
semistructured and this project recognizes the need for conversion filters. However, it does not provide multiple 
language support and, more importantly, does not accommodate changes in the schema. CACTUS 's conversion 
filters aim at code reuse, not change management. This project has 00 binding and manual conversion (the sequence 
of conversions is not determined automatically). Complexity of the data format precludes stream processing. 

Perhaps the most relevant group of projects for our purposes involves the scientific data management community. 
Especially interesting are the projects in rapidly evolving domains, such as bioinformatics. DataFoundry (d), Q pro- 
vides a unifying database interface to diverse bioinformatics sources. Both the data and the schema of these sources 
evolve quickly, so DataFoundry has to deal with change management — by far more complex change management 
than the kind we consider here. However, DataFoundry only provides mediators for database access. It does not 
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Table 1: A survey of PSE-like systems and XML technologies. The binding row shows that most systems sup- 
port only one paradigm. Only DataFoundry fully supports conversion. Other systems either provide a library of 
conversion primitives and leave their composition up to the user (CACTUS) or do not recognize the need for con- 
version at all (CCAT). No system or technology fully supports validation, binding, and conversion. Most systems 
and technologies cannot dynamically process streams of data. 

integrate with simulation execution. This system takes full advantage of conversion, but provides only an SQL 
binding. Introducing bindings for procedural languages would involve significant changes to DataFoundry. 

Table | summarizes related work. It turns out that no known PSE-like system takes full advantage of both 
binding and conversion. XML technologies for validation and binding are well established, but XML transformation 
technologies do not support PSE-style conversion. Very few systems can integrate with a PSE execution environment 
because most of them do not meet the stream processing requirement. This paper develops a system that satisfies 
all of our data interchange requirements. The next three sections describe our handling of validation, binding, and 
conversion. System integration is outlined in Section |6[ 

3 Validation 

Validation establishes conformance of a data instance to a given schema. It is a prerequisite to binding and conver- 
sion. (This definition of validation is a small part of the process of validation in a PSE, which is concerned with 
the larger issue of a model being appropriate to solve a given problem; but, it suffices for the purpose of this paper.) 
The schemas for PSE data are easy to obtain since computational science traditionally uses rigid data structures, 
not loosely formatted documents. Describing the data structures in terms of schemas has several benefits. First, 
language-neutral schemas allow for interoperability between different languages (see requirement 1 in the previous 
section). Second, schemas facilitate database storage and retrieval. Third, appropriate schemas help assign interpre- 
tations to various data fields. It is such interpretation that makes automatic conversion possible (requirements 2-6). 

What kind of validation is appropriate for PSE data? Requirement 7 calls for the most expressive schema 
language that can be parsed by a stream parser. In other words, we are looking for a schema language that can be 
defined in terms of an LL(1) grammar (The LR family of grammars is more expressive, but LR parsers do not 
follow stream semantics.) Therefore, a predictive parser generated for a given schema can validate a data instance. 
This section describes a schema language (BSML) appropriate for a PSE and the steps for building a parser generator 
for this language. We present an example, an informal overview of BSML features, and a formal definition for a large 
subset of BSML in terms of a context-free grammar. Further, predictive parser generation is outlined and grammar 
transformations specific to BSML are described in detail. Finally, we show that BSML is strictly less expressive 
than LL(1) grammars. 

Let us start with an example. Figures || and |] depict a (simplified) schema for an octree environment decompo- 
sition. (Fig. H| describes it in XML notation while Fig. ^| uses a non-XML format that will be useful for describing 
some functionalities of BSML). This is the most complex schema in S 4 W, not counting the schema for the schema 
language itself. An octree consists of internal and leaf nodes that delimit groups of triangles. Recall from Section [O] 
that this grouping is used to limit the intersection tests in ray tracing. The nested structure of an octree maps nicely 
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into an XML tree. Since many components work with lists of triangles, there is a separate schema for a list of 
triangles. As the example shows, the features of BSML closely resemble those of other schema languages, such as 
RELAX NG. The only noticeable difference is the presence of units in the definitions of primitive types. Units will 
be useful for certain types of conversions. Figure || shows an LL(1) grammar generated from the octree schema. 
This grammar is then annotated with binding code and used to generate a parser for octree data. The parser can be 
linked with a parallel ray tracer written in C. 

The DTD for the current version of BSML is given in Appendix |A|. The schema language describes primitive 
types and schemas. There are four base primitive types: integer, string, (IEEE) double, and boolean. Users can 
derive their own primitive types by range restriction. User-derived types usually have domain-specific flavor, such 
as coordinates and distances in the example above. We do not support more complicated primitive types, such 
as dates and lists, because each PSE component treats them differently. Schemas consist of four building blocks: 
elements, sequences, selections, and repetitions. Strictly speaking, repetitions can be expressed as selections and 
sequences, but they are so common that they deserve special treatment. Derivation of schemas by restriction is 
not supported, but derivation by extension can be implemented via inter-schema references. Mixed content is not 
supported because it is only used for documentation. Instead, BSML supports a wildcard content type. The contents 
of this type matches anything and is delivered to the component as a DOM tree [g]. We do not support referential 
integrity constraints because they can delay binding and thus break requirement 7. There is no explicit construct for 
interleaves. In some ways, interleaves are handled by the conversion algorithm. In other words, BSML is a simple 
schema language that incorporates most common features that are useful in a PSE. 

Parser generation for a BSML schema follows the standard steps from compiler textbooks [||]: 

1. convert the schema to an LL(1) grammar, 

2. eliminate empty productions and self-derivations, 

3. eliminate left recursion, 

4. perform left factoring, 

5. perform miscellaneous cleanup (described in detail below), 

6. compute a predictive parsing table, and 

7. generate parsing code from the table. 

The only steps specific to this schema language are generating an LL(1) grammar (step 1) and miscellaneous 
cleanup (step 5). Since grammars have been in use for a long time, it is pertinent to define BSML semantics in terms 



of how the schemas are converted to grammars. The terminals are defined by SAX events [10]. The start of element 
and end of element events are denoted s(name) and e(name), respectively, where name is element name. We omit 
the attributes for simplicity, but BSML supports them in an obvious way. Further, we assume that the SAX parser 
inlines external entity references. Character data is accumulated until the next start of element or end of element 
event and delivered as a d(base, min, max, number, finite, units) terminal, abbreviated as d (see Appendix |A| for 
d's attributes). Generated code checks character data conformance to the type constraints. This definition of d is 
appropriate since BSML does not support selections based on the type of character data. 

One root non-terminal is initially generated for each schema block (element, sequence, selection, repetition), 
each reference to a primitive type, and each string of user code. We denote non-terminals by capital letters, the 
start non-terminal by S, the empty string by e, and the root non-terminals generated for the children of each schema 
block by X\, X2, . . . , X n , n > 0. Further, lower-case Greek letters denote (possibly empty) sequences of terminals, 
non-terminals, and, in the next section, user codes. With this notation in mind, the definition of BSML is in Figure || 
(more details follow in future sections). We slightly deviate from a context-free grammar to allow for the constraints 
on the number of repetitions (see next section). To reiterate, a grammar generated from a schema according to this 
definition will undergo several standard equivalence transformations before a grammar of the form shown in Figure ^| 
is obtained. 
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<type id=' distance' base=' double' number=' true' f inite=' true' /> 
<type id=' coordinate' base=' double' number=' true' f inite=' true' /> 

<schema id=' triangles ' > 
<repetition> 

<element name='tr'> 

<repetition min='3' max='3'> 
<element name='v'> 



</ element> 
</ repetition> 
</ element> 
</repetition> 
</ schema> 

<schema id=' octree' > 

<element name=' octree' > 

<element name='oi' id='oi'> 



<attribute name='x' type=' coordinate' units='m'/> 
<attribute name='y' type=' coordinate' units='m'/> 
<attribute name='z' type=' coordinate' units='m'/> 
<attribute name='dx' type=' distance' units='m'/> 
<attribute name='dy' type=' distance' units='m'/> 
<attribute name='dz' type=' distance' units='m'/> 
<ref id=' triangles' /> 
<repetition> 
<selection> 

<ref id=' oi' /> 
<element name='ol'> 

<attribute name='x' type=' coordinate' units='m'/> 
<attribute name='y' type=' coordinate' units = 'm'/> 
<attribute name='z' type=' coordinate' units='m'/> 
<attribute name='dx' type=' distance ' units='m'/> 
<attribute name='dy' type=' distance ' units='m'/> 
<attribute name='dz' type=' distance ' units = 'm'/> 
<ref id=' triangles ' /> 
</ element> 
</ selection> 
</ repetition> 



</ element> 
</ element> 
</ schema> 

Figure 3: BSML schemas for an octree decomposition of an environment, in XML notation, 'tr' stands for a triangle, 
V stands for a vertex, 'oi' stands for an internal node, and 'ol' stands for a leaf. 



<attribute name='x' 
<attribute name='y' 
<attribute name='z' 



type=' coordinate' units='m'/> 
type=' coordinate' units='m'/> 
type=' coordinate ' units='m'/> 
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type (distance, double, $, $, true, true, $) 
type (coordinate, double, $, $, true, true, $) 



schema (triangles, 

repetition ($, $, $, $, 
element ( $ , $ , tr , 

repetition ($ , $, 3, 
element ( $ , $ , v, 
attribute ( $ , x, 
attribute ( $ , y, 
attribute ( $ , z , 

) 

) 



3, 



data (coordinate, $, $, $, $,m) ) , 
data (coordinate, $, $, $, $,m) ) , 
data (coordinate, $, $, $, $,m) ) 



dx, 
dy, 
dz , 



schema (octree, 

element ($, $, octree, 
element (oi, $, 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
ref (triangles ) , 
repetition ($, $, 
selection ( $ , $ , 
ref (oi) , 

element ( $ , $ , ol , 
attribute ( $ , x, 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
attribute ( $ , 
ref (triangles 



01, 

x, data (coordinate, $,$,$, $,m) 
y, data (coordinate, $,$,$, $,m) 
z, data (coordinate, $,$,$, $,m) 



data (coordinate, $, $, $, $,m 
data (coordinate, $, $, $, $,m 
data (coordinate, $, $, $, $,m 

$, $, 



) , 
) , 
) , 

) ) 
) ) 



y, 

z, 
dx, 
dy, 
dz , 



data (coordinate, $, $, 
data (coordinate, $, $, 
data (coordinate, $, $, 
data (coordinate, $, $ 



data (coordinate, $, $ 
data (coordinate, $, $ 



$,m) ) 
$,m) ) 
$,m) ) 
, $,m) 
, $,m) 
, $,m) 



Figure 4: BSML schemas from Figure ||| in a non-XML notation. $ stands for a missing value, i.e., a suitable default 
value is supplied by BSML software. 
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s - 


-> s(octree), s(oi),T, C, e(oi), e(octree) 


T - 


-> e 


T - 


-> {B t }, stir), {B v \, s(v), e(v), \A V \, V, {E v \ , etir) AA t \ , T' {E t \ 


V - 


-> e 


r - 


-> s(ir), s(v), e(v), \A V \, V, {E v \, etir), \A t \,T' 


V - 


-> e 


V - 


->• s(v),e(v),{A v \,V 


c - 




c - 




a 


-> s(oi),T,C,e(oi) 


a - 


-> s(ol),T,e(ol) 


C" - 


-> e 


C" - 


-> / 


I 


- s(oi),T,I' 


I 


■+ S ( O 0,T,e(o0,{^},C" 


r - 


- C", {A}, C", {£;}, e(oi), {A}, C" 


V - 


- e(o»),{A},C 



Figure 5: LL(1) grammar corresponding to the octree schemas in Figures || and [|. Attributes are omitted for 
simplicity. Patterns of the form {c} will be explained in the next section (they are related to repetitions). Non- 
terminals T, T', and V are related to triangles; others are related to octree decomposition of a set of triangles. 



element(icf, opt, name, B\ , B2, ■ ■ . , B n ) 


E - 


-> s(name),X\,X2, ■ ■ ■ ,X n ,e(name) 




E - 


-> e if opt 


sequence(i<i, opt, B\, B2, . . . , B n ) 


Q 


-> x± , x% , . . . , x n 




Q 


4 e if opt 


selection (id, opt, B\, B2, . . . , B n ) 


L - 






L - 


■* x 2 




L - 


■+ x n 




L - 


^ e if opt 


repetition (id, opt, min, max, B\ , B2, . . . , B n ) 


R - 


■+ {B},X l ,X 2 ,...,X n ,{A},R',{E} 




R! - 


-> Xi, X2, . . ■ , X n ,{A}, R' 




R! - 


-* e 




R - 


4 e if opt or min = 


database, min, max, number, finite, units) 


D - 


-> d(base, min, max, number, finite, units) 


code(c) 


C - 


- {c} 



Figure 6: L-attributed definition of BSML. Schema primitives, in a non-XML notation, are on the left (see Figure |] 
for an example) and their translations to grammar productions are on the right. B\ , B2 , ■ ■ ■ , B n are the children of 
the schema block and X±, X2, ■ ■ ■ , X n are the root non-terminals generated for B\, B2, ■ ■ ■ , B n , respectively, opt is 
a boolean block attribute; true means that the block is optional. {B}, {A}, {E}, and {c} are binding codes explained 
in the next section. References to schema blocks (denoted by ref(id)) are replaced with root non-terminals of the 
blocks being referenced. Definitions related to XML attributes are omitted. 
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The puipose of miscellaneous cleanup is to reduce the number of non-terminals in the grammar. These ad-hoc 
rewritings do not guarantee that the resultant grammar is minimal in any strict sense. Instead, they address some 
inefficiencies that other steps are likely to introduce. These cleanup steps were also chosen such that if the grammar 
were LL(1) before cleanup, it would remain LL(1) after cleanup. The grammars shown in this paper have undergone 
two cleanup rewritings. Each rewriting is applied until no further rewriting is possible. 

1 . Maximum length common suffixes are factored out. 7^ e is the maximum length common suffix of a non- 
terminal A / S if (a) all of A's productions have the form A — > cq,/3, 1 < i < n, (b) /3 is of maximum length, 
and (c) neither (3 nor any contain A. If n = 1, A is eliminated from the grammar and all occurrences of A 
in the grammar are replaced with (3 (a\ = e because (3 is of maximum length). We call such non-terminals 
trivial. Trivial non-terminals are often introduced by schema-to-grammar conversion rules. If n > 1, all 
occurrences of A on the right-hand sides of all grammar productions are replaced with A/3 and the suffix (3 is 
deleted from all of A's productions. The purpose of this rewriting is to uncover duplicate non-terminals for 
the next step. 

2. Only one of any two duplicate non-terminals is retained. Two non-terminals A / B are duplicate if whenever 
A — > a is in the grammar, B — > a is also in the grammar, and vice versa. A is eliminated if A 7^ S, B is 
eliminated otherwise. This definition is weak, e.g., A and B are not considered duplicate if A — > aAf3 and 
B — > aB(3 are in the grammar. However, it suffices for our purposes. 

The expressive power of LL(1) grammars is well known. In practice, the limiting factor is not that the grammar 
is LL(1), but that the grammar is annotated with user codes. The next section gives two examples of grammars that 
are not convertible to LL(1) because binding codes are present. A more interesting question is how the expressive 
power of LL(1) grammars compares to the expressive power of BSML. It is easy to see that BSML can express 
a proper subset of LL(1) grammars. For example, S — ► s(x), e(y) is a valid LL(1) grammar, but BSML cannot 
express it since no XML document that conforms to this grammar is well-formed. 

Observation 1. Consider a subset of BSML that excludes repetitions and user codes. We say that BSML can 
express a grammar G if a predictive parser generated from some schema in this restricted subset of BSML can 
recognize precisely the language L(G). Clearly, BSML cannot express any grammar G that is not LL(1) (by con- 
struction of the predictive parser). Further, BSML cannot express an LL(1) grammar G unless: 

1. if d\ and efe are data terminals in G, then Va, (3 : S =fr + a, d\ , d,2, (3 (data is atomic), 

2. if d is a data terminal and S =4> + a, d, /3 is a derivation in G, then 

Vx,7 : ([(3 =£>* s(x),7] and [((3 e(x),j) implies (Vy,6> : a 6,e(y))]^j (no mixed contents), and 

3. if s(x) is a start of element terminal, g is e or a data terminal, and S a, s(x), (3 is a derivation in G, then 

g] and [(y 7^ x) implies (V7 : (3 g, e(y), 7)]^ ; similarly, if e(y) is an end of element terminal and 

S a, e(x),(3 is a derivation in G, then ([a g] and [(x 7^ y) implies (V# : a 9, s(x), g)]j (proper 
nesting of elements). □ 

The first two restrictions are specific to BSML and easy to relax. However, the last restriction is inherent in any 
XML schema language. A good schema language cannot describe documents that are not well-formed. These are 
the necessary conditions, but it is not clear whether or not they are sufficient. We define schemas in terms of the 
schema language, not in terms of LL(1) grammars, so converting from grammars to schemas is not considered in 
this paper. 
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This section provided an overview of BSML features and defined BSML in terms of an 'almost context-free' 
grammar. We outlined automatic generation of predictive parsers that validate XML documents. Further, we have 
shown that the descriptive power of BSML is strictly less than that of an LL(1) grammar where the terminals are 
SAX events. The next section extends validation to perform binding. 



4 Binding 

Binding is a way to integrate semistructured data with languages that were not designed to handle it (requirement 1). 
Binding can take several forms, depending on the language. For FORTRAN and C, binding usually means assigning 
values to language variables and calling user-defined code to process these values (procedural binding). It can also 
mean writing the data out in a format understood by the component (format conversion). For Matlab and SQL, 
binding entails generating a script that contains embedded data and processing this script by an interpreter (code 
generation). The last two kinds of binding can be thought of as XSLT-like transformations. 

We implement all three kinds of binding by L-attributed definitions. The schema language is extended by 
allowing user code to be injected in the schema. Schema languages that provide binding are called binding schema 
markup languages. This section describes bindings in BSML and gives an example of their use. Further, we show 
how arbitrary binding codes limit the set of schemas supported by BSML. 

Let c denote an arbitrary string of code. Matching {c} means executing code c while consuming no input tokens. 
No assumptions are made about the nature of c. In particular, c can (and usually does) produce side effects, so 
A — ► {ci},{c2} and A — > {02}, {ci} can yield different results. A syntax-directed definition is a context-free 
grammar extended by allowing {cj} on the right-hand sides of productions. For a syntax-directed definition to be 
useful in binding, Cj must contain references to parts of the document being parsed. We denote such references by 
%x, where x is the id or the name of some element or attribute. When x refers to an attribute or an element of some 
primitive type, %x is a value of the attribute or the data contents of the element. The type of %x is determined by the 
corresponding primitive type. When x refers to an element of a wildcard type, %x is a DOM tree constructed from 



all descendants of x, including itself. This feature can be used for XHTML [ |21[ ] documentation. The set of attributes 
(elements) that are available to code c depends on the placement of c in the syntax-directed definition and the parsing 
strategy. A syntax-directed definition is L-attributed if, for any derivation S a{c}/3, any x referenced in c is 
defined in all derivations of a. That is, all attributes (elements) must be denned in a left-to-right scan before they are 
referenced. L-attributed definitions are easy to implement with an LL(1) parser, but they restrict the set of grammars 
reducible to LL(1). Luckily, these restrictions are not important in practice. 

Figure [7] gives an example binding schema for a PDP (see Section and Figure [8] shows how a parser generated 
from this schema converts a PDP encoded in XML to a Matlab script. This script will then be executed by an 
execution manager (see Section ||). The same schema, with different binding code, can convert an XML file to 
a number of SQL INSERT statements that record the data in a relational database. The semantics of user codes 
are not limited to printing, so a FORTRAN version of this binding can store the PDP in an array to be processed 
later. In other words, BSML bindings are compatible with any execution environment that processes streams of data 
(requirement 7). We use the same approach to convert semistructured data to relational data, Matlab scripts, and C 
structures. 

The {B}, {A}, and {E} codes in Figure |7] are generated for repetitions. They are not necessary for this example, 
but are required to enforce that each triangle has three vertices in the previous example. {B} (begin repetition) 
initializes the repetition count to zero. Each repetition has its own stack of counts. {A} (append) ensures that the 
maximum allowed number of repetitions is not exceeded. {E} (end) checks the minimum number of repetitions. 
Thus, even simple validation (without binding) is implemented in terms of an L-attributed definition, not just an 
LL(1) grammar. 

Unfortunately, L-attributed definitions make predictive parsing of certain grammars impossible. User codes can 
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<element name='pdp'> 

<element name='rds' optional=' true' type='time' units='ns'/> 
<element name='med' optional=' true' type='time' units='ns'/> 
<element name='pp' optional=' true' type='power' units='dBW'/> 
<code>M= [</code> 
<repetition> 

<element name='ray'> 

<element name='time' type='time' units='ns'/> 
<element name=' power' type=' power' units='dBW'/> 
</ element> 

<code>%time %power</code> 
</repetition> 
<code>] ;</code> 
</ element> 
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{%time %power}, {A},X 

Figure 7: (top) Binding schema for a power delay profile, rds, med, and pp stand for various optional statistics: rms 
delay spread, mean excess delay, and peak power. These statistics are ignored in this example, (left) L-attributed 
definition for a power delay profile. {B}, {A}, and {E} stand for codes generated by the parser generator to handle 
repetitions. Otherwise, the meaning of {c} is to print string c, followed by a new line character, after expanding 
element references. For clarity, full suffix factoring was not performed, but trivial productions were eliminated, 
(right) Predictive parsing table for a power delay profile. 



13 



<pdp> 



<rds>23 . 0998</rds> 












<med>20 . 5691</med> 






M= 


r 

L 

o o 
-oo 




<pp>-75 . 5665</pp> 






— 4 


n d i n 
. U I? j / 


<ray><time>-4</time><power> 


-88 


. 0937</powerx/ray> 


— J 


— Z 


A A 1 C 


<ray><t ime>-3</t ime><power> 


-82 


. 4416</powerx/ray> 


-2 


-78 


. 5346 


<ray><t ime>-2</t ime><power> 


-78 


. 534 6</power></ ray> 


-1 


-76 


.2634 


<ray><time>-l</time><power> 


-76 


. 2 634</power></ ray> 


- 


-75 . 


5665 


<rayxtime>0</time><power>- 


75 . 


5 665</power></ ray> 


1 ■ 


-76 . 


4908 


<ray><time>l</timexpower>- 


76. 


4 908</powerx/ray> 


2 - 


-79. 


2101 


<ray><time>2</t ime><power>- 


79. 


2101</power></ ray> 


3 - 


-84 . 


0673 


<ray><tirae>3</time><power>- 


34 . 


67 3</power></ ray> 


24 


-86 


.4976 


<ray><time>2 4</timexpower> 


-86 


. 4976</powerx/ray> 


25 


-84 


. 3451 


<ray><time>2 5</timexpower> 


-84 


. 3451</powerx/ray> 


26 


-84 


. 3173 


<ray><time>2 6</t ime><power> 


-84 


. 3173</power></ ray> 


27 


-85 


. 963 


<ray><time>2 7</timexpower> 


-85 


. 963</power></ ray> 


28 


-87 


. 7374 


<ray><time>2 8</t ime><power> 


-87 


. 7374</powerx/ray> 


29 


-88 


. 6525 


<ray><time>2 9</t ime><power> 


-88 


. 6525</powerx/ray> 


43 


-89 


.2007 


<ray><time>4 3</timexpower> 


-89 


. 2007</powerx/ray> 


44 


-83 


. 17 


<ray><time>4 4</timexpower> 


-83 


. 17</power></ ray> 


45 


-79 


.2179 


< ray ><t ime>4 5< /time ><power> 


-79 


. 217 9</powerx/ray> 


46 


-77 


. 3306 


<ray><time>4 6</time><power> 


-77 


. 330 6</powerx/ray> 


47 


-77 


.4917 


<ray><time>4 7</timexpower> 


-77 


. 4 917</powerx/ray> 


48 


-79 


. 645 


<ray><time>4 8</time><power> 


-79 


. 64 5</power></ ray> 


49 


-83 


. 6205 


<ray><time>4 9</time><power> 


-83 


. 6205</powerx/ray> 


50 


-88 


.7676 


<ray><time>5 0</timexpower> 


-88 


. 7 67 6< /power ></ ray> 


] ; 







</pdp> 

Figure 8: (left) An example PDP in XML. The data corresponds to a simulated channel in the corridor of the fourth 
floor of Durham Hall, Virginia Tech. The post processor samples the channel at 1 ns time intervals to match the 
output of a channel sounder, (right) Matlab encoding of the PDP on the left, output by the parser generated from the 
schema in Figure [7|. 
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prevent elimination of left recursion or left factoring of an L-attributed definition. In the two examples below, gram- 
mars induced from the left-attributed definitions by removing all user code can be transformed to LL(1). However, 
the original L-attributed definitions cannot be transformed to LL(1) without losing the stream semantics of the parser. 

Example 1. Consider a left-recursive schema and the corresponding left-recursive grammar (after eliminating 
trivial non- terminals): 

<selection id='s'> <sequence> 

< ! empty — > 
</sequence> <sequence> 

<code>c</code> <ref id='s'/> 

<element name='x'> <code>b</code> </element> 
</sequence> </selection> 

This grammar permits a derivation of the form S {c} k , {s(x), {b}, e(x)) k , k > 0. However, code b cannot 
be executed before k is known since k executions of code c must precede the first execution of code b. Therefore, no 
LL(1) parser with stream semantics can parse documents that conform to this schema. On the other hand, removing 
{c} from the L-attributed definition yields a grammar that is easily converted to LL(1): 

S -> e S -> e 

S — > S,s(x),{b},e(x) ' S — ► s(x),{b},e(x),S 

This example is easy to generalize. □ 

Observation 2. Consider a set of all productions for a non-terminal A. Since any sequence {ci}{c2} can be 
rewritten as {c}, where c = c\c 2 , we can uniquely represent this set by a single production 

A -» {c 1 }Aa l \{c 2 }Aa 2 \ ■ ■ ■ |{c n }^Q n |/3i|/3 2 | • • • \/3 m , 

where no < j < m, has a prefix {d}A Immediate left recursion can be eliminated from this production 
without delaying user code execution if and only if 

1 . ci = C2 = • • • = c n = e (no user code to the left) or 

2. r ){d}9, 1 < j < m) or (a, j{d}9, 1 < i < ri)] implies (d = efj (no user code to the right) 
and (ci = C2 = • • • = c n ) (same user code to the left). 

In all other cases, execution of user code must be delayed until the last «j is matched. □ 
Consider a derivation of A that is no longer left-recursive (i.e., does not have a prefix of {d}A). All such 
derivations can be written as 

A {cjj }, {cj 2 }, . . . , {ci k }, f3j , a.i k , . . . , ai 2 , Oi 1 , 

where /3j, 1 < j < m, stops left recursion after (at least) k + 1 steps and 1 < h,i2, ■■■ ,ik < n represent the 
choices for Oj in the derivation. Suppose 0j ^{d}9 or ai r y{d}9. The sequence of codes q 1 , c, 2 , . . . , Ci k 
must be executed before code d, but the LL(1) parser will only determine this sequence after it has parsed all of 
(3j,ai k) . . . , cti 2 , cti 1 . Thus, eliminating left recursion entails delaying user code execution in all but the trivial cases 
mentioned above. 



S e 

S -> {c},S,s(x),{b},e(x) 
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Example 2. Left factoring of L-attributed definitions poses similar problems. Consider the following schema and 
L-attributed definition (a more realistic version of this example would have a repetition in place of the x element): 

<selection> <sequence> 
<code>c</code> 

<element name=' x' /xelement name='y'/> 
</sequence> <sequence> 
<code>d</code> 

<element name=' x' /xelement name='z'/> 
</sequence> </selection> 

The decision about whether to execute code c or d cannot be made until s(y) or s(z) is processed. However, removing 
user codes makes this L-attributed definition easy to refactor. Again, we can show a more general condition. □ 

Observation 3. Consider a set of all productions for a non-terminal A written as 

A -> a!Pi\a 2 P2\ ■ ■ ■ |oW3n|7il72| • • • |7m, 

such that a[ = a' 2 = ■ ■ ■ = a' n = a e (a' denotes a with all user code removed) and a is not a prefix of any 
7i, 72) • • • > 7 m - Let the length of a be maximum and the lengths of a.j, 1 < i < n, be minimum subject to n > 2, in 
which case this representation of A is unique. A can be left-factored without delaying execution of user code if and 
only if 

1 . no rewriting of A in the above form exists (no two definitions of A share the same prefix, less user codes), or 

2. a\ = «2 = • • • = <^n (same codes to the left) and A — > 71J72I • • • |7m can be left-factored. □ 

To summarize, we implement bindings in terms of L-attributed definitions from parsing theory. These bindings 
work well in practice, but, in theory, annotating a schema that can be rewritten in LL(1) form can make it no longer 
rewritable in LL(1) form. This difficulty is inherent in L-attributed definitions. We currently assume that the user 
is responsible for resolving such conflicts. In practice, schemas for PSE data rarely require complicated grammars. 
Repetitions take care of most of the recursive schema definitions. To make LL(1) parsing possible, troublesome 
content can be simply enclosed in an extra XML element, whose start and end tags disambiguate the transitions of 
the LL(1) parser. 

5 Conversion 

Conversion is the cornerstone of a system's ability to handle changes and interface mismatches. Conversion in a 
PSE helps to retain historical data and facilitates inclusion of new components. We use change detection principles 
from [pH, with a few important differences. First, our goal is not merely to detect changes, but to make PSE com- 
ponents work despite the changes. Second, we detect changes in the schema, not in the data. The PSE environment 
must guarantee that the data is in the right format for the component. The job of the component is to process any data 
instance that conforms to the right format. Last, change detection and conversion are local to the extent possible. 
Locality is a virtue not only because it allows for stream processing, but also because it limits sporadic conversions 
between unrelated entities. 

Similarly to the two previous sections, this section starts with a comprehensive example. Then, we describe 
the core of the conversion algorithm and outline its limitations. Finally, we extend the initial algorithm to handle 
content replacements: unit conversion and user-defined conversion filters. At this point, it should not come as a 
surprise to the reader that most of the technical limitations of conversion are due to binding codes, not to the nature 



S -> {c},s(x),e(x),s(y),e(y) 
S -> {d},s(x),e(x),s(z),e(z) 
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of the schema language. Therefore, the tedious details of handling binding codes are omitted. The emphasis is on 
non-technical limitations. What forms of semantic conversions can be 'syntactized' in a schema language? When 
does such 'syntactization' back fire and produce undesired outcomes? 

The functional statement of the conversion problem can be given as follows. Given the actual schema S a and 
the required schema S r , replace binding codes in S a with binding codes in S r and conversion codes to obtain the 
conversion schema S c . S c must describe precisely the documents described by S a , but perform the same bindings 
as Sr . 



Example 3. Figure |9] depicts two slightly different schemas for antenna descriptions in S 4 W. The schema at the 
bottom (actual schema) was our first attempt at defining a data format for antenna descriptions. This version sup- 
ported only one antenna type and exhibited several inadequate representation choices. E.g., polar coordinates should 
have been used instead of Cartesian coordinates because antenna designers prefer to work in the polar coordinate 
system. Antenna gain was not considered in the first version because its effect is the same as changing transmitter 
power. However, this seemingly unnecessary parameter should have been included because it results in a more direct 
correspondence of simulation input to a physical system. 

The schema at the top of Fig. |9] (required schema) improves upon the actual schema in several ways. It better 
adheres to common practices and supports more antenna types. However, this schema is different from the actual 



schema, while compatibility with old data needs to be retained (requirement 2). Figure 10 illustrates how addition 



of conversion and binding codes to the actual schema solves the compatibility problem. A parser generated from the 



conversion schema in Figure 10 will recognize the actual data and provide the required binding. □ 



Following Jl 1|], the basic assumption of the conversion algorithm is that the actual schema S a can be converted to 
the required schema S r by some sequence of 'standard' edits. This sequence of edits is called an edit script. Once 
the possible types of edits are defined (what we can call a 'conversion library'), the job of the conversion algorithm 
is to (a) find an edit script that transforms the actual schema S a to the required schema S r and (b) express this 
edit script as data transformations, not schema transformations. In other words, the conversion algorithm looks for a 
systematic procedure that converts actual data instances that conform to S a to the required format S r . This procedure 
is expressed as a conversion schema S c that has the structure of S a , but binding codes from S r and the conversion 
library. S c is then used to generate a parser that parses data instances conforming to S a and acts as if it parsed data 
instances conforming to 

Our conversion algorithm supports four kinds of schema edits: 

1. generalization, 

2. restriction, 

3. reordering, and 

4. replacement. 

We use these terms in reference to the required schema, e.g., 'the required schema is a generalization of the actual 
schema.' Generalization and restriction of schema trees are similar to insertions and deletions in sequence alignment 
problems. Reordering and replacement mostly retain their standard meaning, except we consider replacements of 
sets of schema blocks, not individual schema blocks. We first reduce the problem of converting trees to an easier 



problem of converting sequences (see Figure [ll|). Sequence conversion (rule Q) in this initial formulation performs 
all conversions but replacements. Then, we slightly restrict this definition to make it practical and generalize rule Q 
to accommodate replacements (unit conversion and user-defined conversion filters). 

The conversion algorithm revolves around the 'determines' relation between schemas. Intuitively, an actual 
schema S a should determine a required schema S r if any document that conforms to S a contains sufficient informa- 
tion to construct an 'appropriate' document that conforms to S r . Appropriate' here is obviously a domain-specific 
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<element name=' antennas ' > 
<repetition> 

<element name=' antenna' > 

<element name=' id' type=' string' min='l'/> 
<element name='phi' type=' angle' /> 
<element name='theta' type=' angle' /> 

<element name=' gain' type=' ratio' units='dB' optional=' true' def ault=' ' /> 
<code>puts stdout "%id: %phi %theta %gain"</code> 
<selection> 

<element name=' waveguide' > 

<element name=' width' type=' distance ' units = ' mm' /> 
<element name=' height ' type=' distance ' units='mm'/> 
<code>puts stdout "waveguide: %width %height " </code> 
</ element> 

<element name=' pyramidal_horn' > 

<element name=' width' type=' distance' units='mm'/> 
<element name=' rw' type=' distance' units='mm'/> 
<element name=' height ' type=' distance' units = 'mm'/> 
<element name=' rh' type=' distance' units='mm'/> 

<code>puts stdout "pyramidal horn: %width %rw %height %rh"</code> 
</ element> 
</ selection> 
</ element> 
</repetition> 
</ element> 

<element name=' antennas ' > 
<repetition> 

<element name=' antenna' > 

<element name=' id' type=' string' min='l' 
<element name=' description' type='*'/> 
<element name='x' type=' coordinate ' /> 
<element name='y' type=' coordinate ' /> 
<element name='z' type=' coordinate ' /> 
<element name=' waveguide' > 

<element name=' width' type=' distance' 
<element name=' height ' type=' distance' 
</ element> 
</ element> 
</ repetition> 
</ element> 

Figure 9: Two slightly different schemas for a collection of antennas. The component requires the top schema, 
but the data conforms to the bottom schema. The bottom schema (a) represents antenna orientation in Cartesian 
coordinates, not polar coordinates, (b) lacks antenna gain, (c) requires antenna descriptions, (d) measures antenna 
dimensions in inches, not millimeters, and (e) covers only one antenna type. The schema at the bottom does not 
contain binding codes because they are irrelevant for this example. All binding codes are in Tel. 



units=' in' /> 
units=' in' /> 



18 



<element name=' antennas ' > 
<repetition> 

<element name=' antenna' > 

<element name=' id' type=' string' min='l'/> 
<element name=' description' type='*'/> 
<element name='x' type=' coordinate ' /> 
<element name='y' type=' coordinate' /> 
<element name='z' type=' coordinate' /> 

<code> < ! convert coordinates from rectangular to polar — > 

set _r [expr sqrt (%x*%x+%y*%y+%z*%z) ] 

set %phi [expr atan2 ( %y , %x) ] 

set %theta [expr acos (%z/$_r) ] 
</ code> 

<code> < ! set default gain --> 

set %gain 
</ code> 

<code>puts stdout "%id: %phi %theta %gain"</code> 

<element name=' waveguide' > 

<element name=' width' type=' distance' units='mm'/> 
<code> < ! convert units from inches to millimeters — > 

set %width [expr 25.4*%width] 
</ code> 

<element name=' height ' type=' distance' units='mm'/> 
<code> <!-- convert units from inches to millimeters — > 

set %height [expr 25 . 4*%height] 
</ code> 

<code>puts stdout "waveguide: %width %height "</code> 
</ element> 
</ element> 
</ repetition> 
</ element> 



Figure 10: Actual schema from Figure ^ (bottom) after inserting conversion and binding codes. This schema de- 
scribes the actual documents, but provides the bindings of the required schema (top of Figure ||). We use _r instead 
of %r because the latter could interfere with another use of the name r. 



19 



D r : 


data(6ase a , min a , max a , number a , finite a , units a ) >z data(base r ,min r ,max r ,number r , finite r , 

if bttSCa — bctSCr , Tnifl a ^ TTtlTlr, TTlCLXa ^ TflCLXr , TlUTYlber r =r* TLUTTlbcTa, f ITlitCr =t* fiflitCa, 

units a = units r 


units r ) 


TP ■ 
111 . 


eiemeni^a a , opi a , name a , o a i, u a 2, . . . , o an j c_ eiement|ja r , opi r , name r , o r 2, . . . , WmJ 
if name a = name r , opt a opt r , Q a (C a i,C a 2, ■■■ , C an ) h Qr(C r i,C r2 , C rm ) 




TP ■ 


if Opt a => Opt r , Qa{X a (id a , Opt a , . . .)) >L Q r {C r l,C r2 , • • • , C rm ) 




TP ■ 
Sll r . 


element (^a a5 opt a ,name a , u a i, u a 2, . . . , o an j c_ A r ^a r , opt r , ■ ■ ■) 

if Opt a => Optr, Qa(C a l, C a 2, ■ ■ ■ , C an ) h X r (id r , Opt r , . . .) 




P • 


sequence {ia a , opi a , u a i, o a 2, . . . , o an j c sequence \ia r , opz r , o r i, o r 2, • • • , o rm j 

if Opt a =^ Opt r , Q a (C a i,C a 2, . . ■ , C an ) h Qr(C r l, C r 2, ■ ■ ■ , C rm ) 




P • 


^al' a a) Oy^a-, ■ ■ ■) C- sequence (ia r , opi r , o r i, o r 2, • • • , WmJ 

if Opt a =^ Opt r , Q a (X a (id a , Opt a , ■■■))>: Qr(Crl,Cr2, • • • , C rm ) 




p • 


sequence yia a , opz a , o a i, o a 2 5 • • • , ^an) cl VL r ^a r , opz r , ■ ■ ■) 

if opta =^ Opt r , Qa(C a l, C a 2, ■ ■ ■ , C an ) >Z X r (id r , Opt r , . . .) 






selection (ta a , opz a , o a i, o a 2, . . . , o an j c. selection (ta r , opi r , o r i, u r 2, • • • , o rm j 
if opta =^ opt r ,MC a i : (3!C rj : C a i ^ C rj ) 




Ug . 


s*-a(i(La, opt a , ■ ■ ■) c selection [ia r , opt r , o r i, o r 2, • • • , L/ rm j 
if opta =^ opir, (3!C rj : X a (id a , opt a , ■■■)>: C r j) 




P ■ 


repetition {id a , opt a , rnin a , fnax a , C a i, C a 2i ■ ■ ■ ■> C a n) ^ repetition {id r , opt T , Tnin r , fnax r , C r i, C r 2-, ■ 
if min a > min r ,max a < max r , opt a =^ opt r , Q a (C a i, C a2 , . . . , C an ) h Qr{C r \, C r2 , . . . , C rm ) 


n \ 

• • i ^rm ) 


Rg : 


X a (id a , opt a , ■ ■ ■) h repetition (id r , opt r , min r , max r , C r \, C r 2, • • • , C rm ) 

if min r < l,max r > l,opt a =^ opt r ,Q a (X a (ida,opt a , . . .)) h Qr(C r i,C r 2, ■ ■ ■ ,C rm ) 




F : 


ref (id a ) ^ ref(id r ) 

if X a (id a , opt a , ...) h X r (id r , opt r , . . .) 




Q: 


Qa{C a \i C a 2i ■ ■ ■ t Can) ^ Qr{C r li C r 2i ■ ■ ■ , C rm ) 

if MC r j{. . . , opUj, ■■■) : [(3!C oi : C a i ^ Crj) or (opt rj )} 





Figure 11: Version 1 of the 'determines' relation X a (id a , opt a , ■ ■ ■) >r X r (id r , opt r , • • •) between an actual schema 
block X a (id a , opt a , ■ ■ ■) and a required schema block X r (id r , opt r , ■ ■ •)■ We use the non-XML notation from Fig- 
ure |]plus X a (id a , opt a , ■ ■ ■) and X r (id r ,opt r , . . .) are shortcuts for any schema block (data blocks are never op- 
tional and have empty ids). =>• means logical implication and 3! means 'there exists a unique.' The rules are applied 
top to bottom, left to right. The first matching rule wins (no backtracking). This definition will be later restricted to 
make it computable and rule Q will be extended to handle replacements. 
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notion, and in the absence of a domain theory, there is no hard and fast measure of 'appropriateness.' Given two 
slightly different schemas, only a domain expert can tell whether or not it is meaningful to attempt a conversion 
from one form to another. Therefore, our conversion rules should be viewed as heuristics that we have found to 
be useful enough to be supported in a conversion library. They are neither sound nor complete in an algorithmic 
sense (because we do not have an objective, external, measure of 'conversion correctness'). Instead, they represent 
a tradeoff between soundness and completeness and should be carefully evaluated for use in a particular domain. 
With this disclaimer in mind, version 1 of the determines relation between S a and S r (S a determines S r ; S a >z S r ) 
is defined in Figure 11 . We will also find the notion of schema equivalence useful: we say that two schemas S a and 
S r are equivalent if S a y S r and S r >z S a - 

The first rule (D r ) in Figure 11, for instance, says that a value of primitive type ('data') can be substituted for 
another if they have the same base type, their ranges are compatible, and they have the same units. It ensures that 
all primitive type constraints of S r are met by S a (restriction). Thus, D r is simply a definition of type derivation 
by range restriction (the 'r' subscript in this and other rules stands for restriction; similarly, the 'g' subscript stands 
for generalization). Rules E, P, and R state the obvious: two black boxes are compatible if they have compatible 
wrappers (restriction) and compatible contents (any conversions). Rule C says that any choice in S a must uniquely 
determine some choice in S r (restriction). Rule Q enforces that every block in S r is uniquely determined by some 
block in S a . This formulation of rule Q ignores extra blocks in S a (restriction), permits optional elements in S r to 
be unmatched (generalization), and allows for contents reordering. Rule F deals with references. Only rules D r , 
E, P, C, and R are sound. Rule F looks sound, but it makes the determines relation not computable. Rule Q is 
unsound primarily because it ignores 'unnecessary' blocks in S a . 

Rules Eg, P g , C g , and R g handle generalizations across schema blocks of (possibly) different types. Their 
counterparts E r and P r handle symmetric restrictions (why is there no C r or R r l). Rule C g was demonstrated in 
the example above. It is a base case for rule C. Rule C g states that one way to generalize a schema block is to 
enclose it in a selection, i.e., provide more choices in S r than were available in S a . This rule is sound. Rules E g , P g , 
and Rg have similar motivations, but they are unsound. Essentially, we assume that decorating any black box with 
any number of wrappers does not change the meaning of the black box (generalization). Similarly, we assume that 
wrappers can be freely removed to expose the black box (restriction). 

Consider a sequence of schemas that describes some physical system in progressively greater detail. Suppose 
some subsystem is described by a single parameter. Common practice is to allocate a single schema block to this 
subsystem. What happens when a more detailed description of this subsystem is incorporated into the schema? 
Chances are, the original schema block allocated to the subsystem will be either (a) augmented with more con- 
tents (restriction part of rule Q) or (b) wrapped in another block. The generalization and restriction rules handle 
case (b). However, blind application of these rules can lead to disaster because these rules disregard some semantic 
information. Examples will make these points clearer. 



Example 4. One common trick used to improve wireless system performance is space-time transmit diversity 
(STTD). Instead of a single transmitter antenna, the base station uses two transmitter antennas separated by a small 
distance. PDPs are very sensitive to device positioning, so two uncorrected transmitter antennas can produce widely 
different signals at the same receiver location. If the signal from one of the antennas is weak, the signal from another 
antenna will probably be strong, so the overall performance is expected to improve. Consider how addition of STTD 
to the ray tracer affects the schema of the transmitter file. The original schema is on the left and the new schema 
(with STTD support) is on the right. The second antenna is optional because STTD is not used in every system due 
to cost considerations. 

(continued on next page) 
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<element name=' base_station' > 
<element name='tx'> 

<ref id=' coordinates ' /> 
<element name=' power ' type=' power ' /> 
<element name=' freq' type=' double' /> 
</ element> 

<element name='tx' optional=' true' > 
<ref id=' coordinates ' /> 
<element name=' power ' type=' power ' /> 
<element name=' freq' type=' double' /> 
</ element> 
</ element> 

The new ray tracer should be able to work with old data because it supports one or two transmitter antennas. The 
old ray tracer should be able to work with new data, albeit the results will be approximate when the new data contains 
two transmitter antennas. Further generalizing this example to n transmitter antennas would require a repetition. We 
support conversion to repetitions, but not from repetitions. For this example, we could extract any antenna because 
they usually have the same parameters and are positioned close together. However, we cannot extract an arbitrary 
ray from a PDP because the ray with maximum power is usually intended. Extracting any other ray would typically 
produce nonsense results. □ 



<element name='tx'> 

<ref id=' coordinates ' /> 

<element name=' power' type=' power ' /> 

<element name=' freq' type=' double' /> 

</ element> 



Example 5. Havoc can result if rules E r and E g are applied to the same element. Element names have semantic 
meaning, but this particular composition of rules allows arbitrary renaming of elements. Such renaming would make 
the following two schemas equivalent. 

<element name=' tx_gain' type=' ratio' /> <element name=' snr' type=' ratio' /> 

Even though both transmitter antenna gain and signal-to-noise ratio are ratios measured in the same units (dB), 
they convey largely different information. We avoid such blatant mistakes by limiting the application of generaliza- 
tion and restriction rules. In particular, no element can be renamed. □ 



As the last example illustrates, the 'determines' relation in Figure |11] needs to be restricted. It is helpful to 
redefine this relation in terms of a context-free grammar that describes S a S r . Let the terminals be element (, 
sequence (, selection (, repetition (, ref (, data (, ) , and all element names and other values used 
in two schemas under consideration. Let the non-terminals be the labels of the rules in Figure [II], a special start 
non-terminal A, and intermediate non-terminals introduced by the rules. We can formally define the necessary 
restrictions by limiting the shape of the parse tree for S a S r . Consider a path Ri, R%, . . . , R n , n > 0, from some 
internal node R\ ^ A to some internal node i? n 7^ A, where all R4, 1 < i < n, are rule labels. If 1Z is the set of 
restriction rules and Q is the set of generalization rules, we require that e TV) implies (R4-1 ^ Q and R4+1 ^ Q), 
i.e., restriction and generalization rules cannot be applied in sequence. This restriction of the parse tree disallows 
renaming of elements, but does not limit the number of wrappers around black boxes. Bounded determination deals 
with the latter problem. We say that S a k-determines S r (S a >: k S r ) if no path R±, R2, . . . , R n contains a substring 
of (possibly different) generalization (restriction) rules of length greater than k. We leave it up to the reader to 
appropriately restrict rule F (reference). These restrictions make the 'determines' relation computable and enforce 
locality of conversions. As a side effect, we have shown that the problem of constructing a conversion schema 
S c from the actual schema S a and the required schema S r can be reduced to validation and binding (parsing and 
translation). However, schema conversion need not work with streams of data, so a parser more powerful than a 
predictive parser should be used. 
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It remains to consider requirements 4 and 5: unit conversion and user-defined conversion niters (replacements). 
Let D be a set of all primitive types derived from double (recall that a primitive type is denned by the base type, the 
range of legal values, and a unit expression). Unit conversion, e.g., converting kg/m 2 to lb/in 2 , is the simpler of the 
two replacements. Both actual and required unit expressions are converted to a canonical form (e.g., a fraction of 
products of sums of CI units or dB) and then the conversion function is found. Unit conversions are functions of the 
form 

U : D a -> D r , 

where D a , D r £ D are specific primitive types. User-defined conversion niters are functions of the form 

H : D a i x D a2 x • • • x D an -> D rl x D r2 x • • • x D rm , 

where n, m > and all D a i, D r j £ D, 1 < i < n, 1 < j < m, are specific primitive types. Arithmetic operators and 
common mathematical functions are allowed in user-defined conversion filters. Each user-defined conversion filter 
is tagged with element names name a \ , name a 2 , • • • , name an and name r \ , name r 2 , • • • , name rm that determine 
when the filter applies. Such filters define rules of the form 

(element ($,$,raame a i, Dai), element ($,$,name a 2,-D a 2), ■ • • , element($, $, name an , D an )) y 
(element($, $, name r i, D r \), element ($, $, name r 2, D r 2), ■ ■ ■ , element($, $, name rm , D rm )). 



Both kinds of filters are compiled into codes such as shown in Figure |10| Rule Q is modified to take advantage of 
replacements. Basically, we are looking for (unique) partitions of the actual schema blocks C a \, C a 2, ■ ■ ■ , C an and 
required schema blocks C r \ , C r 2 , ■ ■ ■ , C rm such that each set of schema blocks in the required partition is determined 
by some set of schema blocks in the actual partition. Determination can proceed through the rules in Figure 1_1 , unit 
conversions, and user-defined conversion filters (if everything else fails, optional blocks in the required schema can 
remain unmatched). 

The ultimate goal of the conversion algorithm is to find a meaningful edit script. However, this goal is impossible 
to achieve without knowledge of the domain. What happens when several edit scripts exist, i.e., the problem of 
finding an edit script is ambiguous? Depending on the nature of the ambiguity, we can choose any edit script, the 
minimal (in some sense) edit script, or to refuse to perform conversion. The conversion algorithm described here 
either settles for some local minimum (e.g., rule E is preferred over rule E g ) or requires uniqueness of conversions 
(rules C, C g , and most of rule Q). Ambiguity remains an open problem that is unlikely to be solved by a syntactic 
conversion algorithm. Following the principle of least user astonishment, we choose to reject most of ambiguous 
conversions. 

Finally, let us consider how binding codes limit conversion. We omit formal treatment of the problem and limit 
the discussion to an example. It is easy to see that conversion may require delaying binding code execution. This 
should not be surprising since one kind of conversion is reordering. 



Example 6. Consider a required schema with binding 

<sequence> 

<element name='a' type=' double' /> 
<code>cl</code> 
<repetition> 
<ref id=' b' /> 
<code>c2</code> 
</repetition> 
<sequence> 



codes (left) and an actual schema (right). 



<sequence> 

<repetition><ref id='b' /></repetition> 
<element name='x' type=' double' /> 
<element name='y' type=' double' /> 

<sequence> 
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Assume that there exists a user-defined conversion filter that calculates a from x and y. If we ignore binding 
code c2, conversion is clearly local. However, conversion with c2 present will require delaying all executions of 
c 2 until c 1 is executed. The latter can only happen when the last piece of the schema is matched. In other words, 
binding codes should be placed as late as possible in the schema. □ 

This section presented a number of local conversions appropriate for PSE data. Conversions are carried out 
by extra codes injected in the actual schema. The conversion algorithm was built around the 'determines' relation 
between schemas. The algorithm has some technical limitations related to binding codes, but its major limitation 
is conceptual. Conversion, in the form presented here, is syntactic. It is based on the weak semistructured data 
model, not on the underlying domain theory (wireless communications). Therefore, we can only speculate about the 
causes of differences between the actual and required schemas. There is no guarantee that automatic conversion will 
produce meaningful results. A stronger data model is necessary to perform complex, yet meaningful, conversions. 



6 Integration with a PSE 

A complete PSE requires functionality far beyond validation, binding, and conversion. BSML ensures that the com- 
ponents can read streams of XML data, but it does not support tasks such as scheduling, communication, database 
storage and retrieval, connecting multiple components into a given topology, and computational steering. We broadly 



call software that performs all of these tasks an execution manager. Figure |12| illustrates how BSML software and 
the execution manager function together. 

From a systems point of view, BSML schemas are metadata and the BSML software is a parser generator. Recall 
that the parser generator generates parsers that perform validation, binding, and conversion functions (every such 
generated parser will be able to take input data and stream it through the component). Both the data and the metadata 
are stored in a database. We can distinguish three kinds of metadata: schemas, component metadata, and model 
instance metadata. Only one form of metadata (schemas) was described in this paper. Component metadata contains 
component's local parameters, such as executable name, programming language, and input/output port schemas. It 
is the kind of metadata used in CCAT. Model instance metadata, i.e., component topology and other global execution 
parameters, serves a purpose similar to GALE's workflow specifications. It supports our requirement 3. 

A parser is lazily generated for each used combination of component's input port schema (required schema) and 
the schema of the data instance connected to this port (actual schema). Component metadata specifies how linking 
must be performed (e.g., which of the three kinds of bindings to use). Component instances are further managed 
by the execution manager. Model instance metadata specifies how to execute the model instance (e.g., the topology 
and the number of processors), while model instance data serves as the actual (data) input to the model instance. To 
summarize, the BSML parser generator creates component instances — programs that take a number of XML streams 
as inputs and produce a number of XML streams as outputs. This representation is appropriate for management of a 
PSE execution environment. 



6.1 Status of Prototype 

In S 4 W, the execution manager is implemented in Tcl/Tk and most of the component metadata is hard-coded. Model 
instance metadata consists primarily of the number of processors and a cross-product of references to model instance 
data. An (incomplete) example of such a specification is 

'compute power coverage maps for these three transmitter locations in Torgersen Hall and show a graph 
of BERs with the signal-to-noise ratio varying from zero to twenty dB in steps of two dB; use thirty 
nodes of a 200-node Beowulf cluster.' 

PostgreSQL and the filesystem serve the role of the database. Large files (e.g., floor plans) are typically stored in the 
filesystem and small ones (e.g., PDPs) are usually imported into PostgreSQL. The parser generator is written in SWI 
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Figure 12: BSML integration with PSE execution environment. The BSML parser generator creates parsers that 
handle input ports of each component. Execution manager controls the execution of a model instance that consists 
of components, model instance data, and model instance metadata. Figure [I] partially defines one such instance. 

Prolog. It generates parsers in Tel. Currently, these parsers are used mostly in the execution manager, visualization 
components, and database interfacing components. 

7 Discussion 

We have described the use of validation, binding, and conversion facilities to solve data interchange problems in a 
PSE. Since all three concepts are closely related to parsing and translation, viewing application composition in terms 
of data management uncovers well-understood solutions to interface mismatch problems. The semistructured data 
model allows us to syntactically define several forms of conversions that are usually implemented by hand-written 
mediators in PSEs. Such automation reduces the cost of PSE development and, more importantly, brings PSEs closer 
to their ultimate goal — namely, PSE users should be solving their domain-specific problems, not be beset by the 
technical details of component composition in a heterogeneous computing environment. 

Several extensions to the present work are envisioned. First, the expressiveness of schema languages for data 
interchange and application composition can be formally characterized. This will allow us to reason about require- 
ments such as stream processing from a modeling perspective. Such a study will also lead to a better understanding 
of the roles that a markup language can play in a PSE. Second, dataflow relationships between components can be 
made explicit. BSML guarantees that any component instance be able to process streams of data, but synchronization 
issues are meant to be resolved by the execution manager. Tighter integration of BSML and composition frameworks 
can be explored. Finally, the overall view of a PSE as a semistructured data management system deserves further 
exploration. For example, it seems possible to automatically generate workflow specifications from queries on a 
semistructured database of simulation results. 

Any good problem solving facility is characterized by 'what it lets you get away with.' BSML is unique among 
PSE projects in that it allows a modeler or engineer to flexibly incorporate application-specific considerations for 
data interchange, without insisting on an implementation vocabulary for components. 
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A BSML DTD 

< ! ENTITY % boolean " (true | false | t | f | yes | no | y | n) "> 



<! — attributes of primitive types: 

min - minimum value or string length (inclusive) 

max - maximum value or string length (inclusive) 

number - true means NaN is not allowed (doubles only) 

finite - true means +/-infinity is not allowed (doubles only) 

units - units for this type (doubles only) 

— > 

<! ENTITY % type_attributes " 

min CDATA #IMPLIED 

max CDATA #IMPLIED 

number ^boolean; #IMPLIED 

finite ^boolean; #IMPLIED 

units CDATA #IMPLIED 

"> 



<! — what schemas and schema blocks are composed of — > 
<! ENTITY % schema_contents " 

(element | sequence | selection | repetition) 

"> 

<! ENTITY % block_contents " 

(%schema_contents; | default | ref | code) 

"> 
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<! — a collection of schemas --> 

< ! ELEMENT schemas ((description)?, (type | schema) *)> 
< ! ATTLIST schemas> 

<! — primitive type: attributes above and an optional 
enumeration of legal values; derivation works by restriction; 
builtin base types are: integer, string, double, boolean --> 
< ! ELEMENT type ((description)?, (values) ?)> 
< ! ATTLIST type 

id CDATA # REQUIRED 

base CDATA # REQUIRED 

%type_attributes ; 



— enumeration of legal values, no value is legal if empty — > 

ELEMENT values ( (value) *)> 

ATTLIST values> 

ELEMENT value (#PCDATA)> 

ATTLIST value> 



<! — schema — > 

<! ELEMENT schema ((description)?, (code)*, (%schema_contents; ) , (code) 
<! ATTLIST schema 

id CDATA # REQUIRED 

> 

<! — an element can contain either 

(a) character data of a primitive type (type attribute is present), 

(b) zero or more schema blocks (type attribute is absent), or 

(c) when type='*', any contents. 

— > 

< ! ELEMENT element ((description)?, (attribute)*, 

((values)? | (%block_contents; ) *) ) > 

< ! ATTLIST element 

name CDATA # REQUIRED 

id CDATA #IMPLIED 

optional ^boolean; "false" 
type CDATA #IMPLIED 

%type_attributes ; 

default CDATA #IMPLIED 



<! — an attribute must contain a value of some primitive type — > 
< ! ELEMENT attribute ((description)?, (values) ?)> 
< ! ATTLIST attribute 

name CDATA # REQUIRED 

id CDATA #IMPLIED 

type CDATA "string" 

%type_attributes ; 

default CDATA #IMPLIED 
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<! — a sequence is just a grouping, for convenience — > 
< ! ELEMENT sequence ((description)?, (%block_contents; ) *) > 
< ! ATTLIST sequence 

id CDATA #IMPLIED 

optional ^boolean; "false" 

> 



<! — a selection denotes a mutually exclusive choice of contents — > 
< ! ELEMENT selection ((description)?, (%block_contents; ) +) > 
< ! ATTLIST selection 

id CDATA #IMPLIED 

optional ^boolean; "false" 

> 



<! — a repetition denotes [min. .max] 
< ! ELEMENT repetition ((description)?, 
< ! ATTLIST repetition 

id CDATA #IMPLIED 

optional ^boolean; "false" 

min CDATA "0" 

max CDATA "inf" 

> 



repetitions of contents 
(%block_contents) *) > 



<! — a reference to some block id in this schema, 
or to an id of a different schema — > 
< ! ELEMENT ref ( (description) ?) > 
< ! ATTLIST ref 

id CDATA # REQUIRED 

> 



<! — user code; language and component attributes facilitate 
schema reuse (different components can have the same schema, 
but different binding codes) --> 
<! ELEMENT code (#PCDATA)> 
< ! ATTLIST code 

language CDATA #IMPLIED 

component CDATA #IMPLIED 

> 



<! — default contents must conform to BSML schema block — > 
<! ELEMENT default ANY> 



<! — XHTML usually goes here — > 
<! ELEMENT description ANY> 
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